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Abstract 

We develop a multi-level restricted Gaussian maximum likelihood method for estimat¬ 
ing the covariance function parameters and computing the best unbiased predictor. Our ap¬ 
proach produces a new set of multi-level contrasts where the deterministic parameters of the 
model are filtered out thus enabling the estimation of the covariance parameters to be decou¬ 
pled from the deterministic component. Moreover, the multi-level covariance matrix of the 
contrasts exhibit fast decay that is dependent on the smoothness of the covariance function. 
Due to the fast decay of the multi-level covariance matrix coefficients only a small set is com¬ 
puted with a level dependent criterion. We demonstrate our approach on problems of up to 
512,000 observations with a Matern covariance function and highly irregular placements of 
the observations. In addition, these problems are numerically unstable and hard to solve with 
traditional methods. 
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1 Introduction 


Consider the following model for a Gaussian spatial random field Z: 

Z(s) = m(s)^)S + e(s), s G (1) 

where m G R^ is a known function of the spatial location s, jS G R^ is an unknown vec¬ 
tor of coefficients, and e is a stationary mean zero Gaussian random field with parametric 
covariance function C(s, s';0) = cov{e(s),e(s')} having an unknown vector 9 G R”' of pa¬ 
rameters. We observe the data vector Z = (Z(si),...,Z(s„))^ at locations S := {si,..., s„}, 
where Si 7 ^ S 2 7 ^ S 3 7 ^ • • • 7 ^ s„_i 7 ^ s„, and wish to: 1 ) estimate the unknown vectors )S and 
0; and 2 ) predict Z(so), where sq is a new spatial location. These two tasks are particularly 
challenging when the sample size n is large. 

To address the estimation part, let C(9) = cov(Z, Z^) G R"^” be the covariance matrix of 
Z and assume it is nonsingular for all 9 G R™". Define M = (m(si)... m(sn)) G R”^P and 
assume it is of full rank, p. The model (1) leads to the vectorial formulation 

Z = MjS + e, (2) 

where e is a Gaussian random vector, e ^ A/'n(0, C(9)). Then the log-likelihood function is 

= -5log(27r) - llogdet{C(0)} - i(Z - M|8)’’C(e)-‘(Z - Mfi), (3) 

which can be profiled by generalized least squares with 

jS(0) = {M^C(9)-'^M}-^M^C(9)-^Z. (4) 

A consequence of profiling is that the maximum likelihood estimator (MLE) of 9 then tends 
to be biased. A solution to this problem is to use restricted maximum likelihood (REML) 
estimation which consists in calculating the log-likelihood oin — p linearly independent con- 
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trasts, that is, linear combinations of observations whose joint distribution does not depend 
on )S, from the set Y = {!„ — M(M^M)“^M^}Z. In this paper, we propose a new set of con¬ 
trasts that lead to significant computational benefits (with good accuracy) when computing 
the REML estimator of 9 for large sample size n. 

To address the prediction part, consider the best unbiased predictor ^(sq) = Aq + A^Z 
where A = (Ai,..., A„)^. The unbiasedness constraint implies Aq = 0 and M^A = m(so). The 
minimization of the mean squared prediction error E[{Z(so) — A^Z}^] under the constraint 
M^A = m(so) yields 

Z(so) = m(so)^^ + c(e)^C(0)-\Z - M^), (5) 

where c(0) = cov{Z, Z(so)} G IR” and jS is defined in (4). In this paper, we propose a new 
transformation of the data vector Z leading to a decoupled multi-level description of the 
model (1) without any loss of structure. This multi-level representation leads to significant 
computational benefits when computing the kriging predictor Z(so) in (5) for large sample 
size n. 

Previous work has been performed to maximize (3). The classical technique is to compute 
a Cholesky factorization of C. However, this requires 0{rp-) memory and 0{n^) computa¬ 
tional steps, thus impractical for large scale problems. 

Under special structures of the covariance matrix, i.e., fast decay of the covariance func¬ 
tion, a tapering technique can be used to sparsify the covariance matrix and thus increase 
memory and computational efficiency (Eurrer et al. (2006); Kaufman et al. (2008)). These tech¬ 
niques are good when applicable but tend to be restrictive. Eor a review of various approaches 
to spatial statistics for large datasets, see Sun et al. (2012). 

Recently we have seen the advent of solving the optimization problem (3) from a compu¬ 
tational numerical perspective. Anitescu et al. (2012) developed a matrix-free approach for 
computing the maximum of the log-likelihood (3) based on a stochastic programming refor- 
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mulation. This method relies on Monte Carlo approximation of the derivative of the score 
function with respect to the covariance parameters d to compute the maximization (3). The 
authors show promising results for a grid geomefry of the placement of the observations. 
However for a non-grid geomefry the cost of computing the preconditioner becomes 0{rp-) 
and it is not clear how many iterations for convergence are needed as the geometry deviates 
from a grid. Moreover, due fo the slow convergence rate of the Monte Carlo method 
convergence rate where 7/ is the number of realizations) many samples might be required be¬ 
fore a suitable estimate is obtained. The previous work was extended in Stein et al. (2013). 
Although the results are impressive (1,000,000 -i- size problems), the approach is restricted to 
regular grid geometries with partially occluded areas. 

Stein et al. (2012) presented a difference filter preconditioning for large covariance matri¬ 
ces not unlike our multi-level method. By constructing a preconditioner based on the dif¬ 
ference fiber the number of iferafions of a Preconditioned Conjugate Gradient (PCG) drops 
significantly. However, the authors can only construct a preconditioner for irregularly placed 
observations in ID and for a regular grid in higher dimension. Moreover, the authors point 
out that the restrictions on the spectral density of the random field Z are strong. 

In Stein et al. (2004) the authors proposed a REML method in combination with an approx¬ 
imation of the likelihood. This approach uses a truncation method to compute an approxi¬ 
mation of the likelihood function. It appears to be effective if the truncated terms have small 
correlations. However, if the covariance function has a slow decay then we expect that this 
approximation will not be accurate unless a large neighborhood is incorporated. Moreover, 
this paper does not include an analysis of the error with respect to the truncation. 

In Sun and Stem (2015) the authors proposed new unbiased estimating equations based on 
score equation approximations. The inverse covariance matrix is approximated with a sparse 
inverse Gholesky decomposition. As in Stem et al. (2004) the approximation is expected to 
be fast and accurate for locally correlated observations but will suffer from slow decay of the 
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covariance function. Moreover, the results are limited to grid-like geometries. 

In the next section we present the basic ideas behind our approach. In Section 3 we show 
the construction of a multi-level basis from the observations points. In Section 4 we describe 
how to efficiently construct a multi-level covariance matrix that arises from the new basis. In 
Section 5 a multi-level estimator is proposed. In Section 6 the multi-level kriging approach is 
described. In Section 7 hard to solve numerical examples are provided and compared with 
traditional methods. In Section 8 we give concluding remarks. Proofs are relegated to the 
Appendix A and a notation summary can be found in Appendix B. We also include compu¬ 
tational and mathematical details in the remarks. However, these may be skipped on a first 
reading except for the more mathematically oriented reader. 

2 Multi-Level REML and Kriging Basic Approach 

We now present the main ideas of our proposal. Denote by (S) the span of the columns 
of the design matrix M. Let L G be an orthogonal projection from R” to V^{S) and 

W G R(”- -p)xw orthogonal projection from R” to 7^P(S)"’", the orthogonal complement 

r w 1 

of (S). Moreover we assume that the operator ^ is orthonormal. 

By applying the operator W to (2) we obtain Zw = WZ = W(M)S -|- e) = We. Our 
first observation is that the trend contribution MjS is filtered out from the data Z. We can 
now formulate the estimation of the covariance parameters d without the trend. The new 
log-likelihood function becomes 

<w(0) = -^log(27i) - ilogdel{C„(e)} - izT c„(0)-'z„, (6) 

where Cw(0) = WC(0)W^ and Z^v ^ A/'„-p(0, WC(0)W^). As shown in Section 5 , to esti¬ 
mate the coefficients 9 it is not necessary to compute iw{^) but a multi-resolution version. 

A consequence of the filtering is that we obtain an unbiased estimator. Moreover, a further 
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consequence is that if v 7^ 0 then 


0 < min 

v6K” 


v^c{e)v 

llvip 


< 


v^c(e)v 

mm — " 

V6]R”\PP(S) ||v||"^ 


< 


v^c(e)v 

max — I - 

V6]R"\PP(S) ||v||"^ 


< max 
v6R” 


vTc(0)v 

llviP 


(7) 


This implies that the condition number of Cw(0) is less than or equal to the condition number 
of C(0). Thus computing the inverse of Cw(0) will be in general more stable than for C(0). 
In practice, computing the inverse of Cw(0) will be much more stable than C{6) (See the 
results in Tables 4 and 5). High condition number are very bad for numerical methods in 
general. In general any numerical method will suffer if the condition number is high. Finally, 
the uncertainties in the parameter estimates obtained from (6) can be quantified using the 
Godambe information matrix as described in Sect. 2 and Appendix B of Stein et al. (2004). 

As shown in Section 4, for covariance functions that are differentiable up to a degree / + 1 
(except at the origin), such as the Matern, our approach leads to covariance matrices Cw 
where most of the coefficients are small and thus can be safely eliminated. We construct a 
level dependent criterion approach to determine which entries are computed and the rest are 
set to zero. With this approach we can now construct a sparse covariance matrix Cw that 
is close to Cw in a matrix norm sense even if the observations are highly correlated with 
distance. 

The sparsity of Cw will depend on the following: i) a positive integer t, which is a multi¬ 
level distance criterion; ii) a positive integer /, which is the degree of the multi-level basis 
and associated accuracy parameters p; and iii) the smoothness of the covariance function. The 
accuracy of will depend monotonically on these parameters, i.e., as we increase t and / 
(and respectively p) the matrix Cw will be closer to Cw in a norm sense. This is explained in 
detail in Section 4. 

The choice of the projectors L and W will determine how efficiently each likelihood func¬ 
tion (6) evaluation is solved. Indeed, we desire the transformation to have the following properties: 
i) Stability: The matrices L and W have orthogonal rows and the stacked matrix [L; W] is 
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orthonormal; ii) Fast computation: The computational cost of applying the matrix [L; W] to a 
vector is 0(n(logn)^) for some small integer iii) Fast log determinant computation: The 
computational cost of computing logdet{Cw(d)} to be bounded by in 2D and 0{n^) 

in 3D. We also want to restrict the memory storage to 0(n(logn)^); iv) Fast inversion: The 
computational cost of computing Cw(0)“^Zvv to a desired accuracy e is better than 0{rP-). 
Memory storage is also desirable to be restricted to 0(n(logn)^); v) Accuracy: Determinant 
computation and inversion are also required to be accurate. We achieve the properties i) - v) 
in this paper. 

In Section 3 we describe how to construct multi-level matrices L and W that satisfy prop¬ 
erties i) and ii) for most practical observation location placements (random for example). We 
apply L and W to construct the sparse multi-level covariance matrix Cw(0)- The determinant 
of the multi-level sparse covariance matrix Cw(d) and the term Qy^{Q)~^Ty^ are computed by 
exploiting an accurate sparse Cholesky representation of Cw (properties iii) and v)). The term 
Cw (0)-iZ vv can also be computed by applying a Preconditioned Conjugate Gradient (PCG) 
to a desired accuracy (properties iv) and v)). In Section 6 the multi-level kriging method is de¬ 
scribed. In Section 7 we demonstrate the efficiency of our method for numerous covariances 
and irregularly placed observations. We are able to solve the fast inversion for up to 512,000 
observations to a relative accuracy of 10“^ with respect to the unpreconditioned system. It is 
important to note that the achieved accuracy of preconditioned system will not necessarily 
imply accuracy of unpreconditioned system if the condition number of the preconditioner is 
high. Furthermore, we test our approach to estimate the covariance parameters of problems 
of up to 128,000 observations. In addition, the accuracy of the kriging estimates are tabulated 
for different size problems. 
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3 Multi-Level Basis 


In this section we establish the general structure of the Multi-Level Basis (MB) that is used 
solve the estimation and prediction problem. We refer the reader to Castrillon-Candas et al. 
(2013) for a detailed description. The MB can then be used to: (i) form the multi-level REML 
function (6); (ii) sparsify the covariance matrix Cw{d)} and (iii) improve the conditioning over 
the covariance matrix C(0). But first, we establish some notation and definitions: 


• Let a := (ai,...,^^) G Z^, |a| := ai + ■ ■ ■ + aa, x := [xi,..., Xa]^ and := 
For any h G Nq (where Nq := N U {0}) let be the set of monomials 
I \x\ < h}. Furthermore, let be the design matrix with respect to all the 
monomials in Qj^. The number of monomials of degree h with dimension d is 



• We shall restrict the Gaussian spatial random field (1) design matrix M to M^, where 
/ is the degree of the model. Thus p will be equal to the number of monomials in Q^, 
which is p := 



Let/ > f be the degree of the multi-level basis and M ^ the associated design matrix. The 

'd + f 

f 


f 

number of monomials in Q‘j shall be referred as the accuracy parameter p := 


These parameters are chosen by the user and are used to construct the multi-level basis. 


• Let C(0) := {(p{rij;6)} where (p is the covariance function, := ||s; — sy||2 and S;, sy G 

for i,j = 1,..., n. Alternatively we refer to /(x, y; 6) as (p{r; 6), where r := ||x — y ||2 
and x,y G R^. Suppose V^{S) is the span of the columns of the design matrix My:. We 
now assume that (p{r; 6) is a positive definite function and (R) for all r G R except 
at the origin. 

• For any index i,j G Nq, 'i<i<n,l<j<n, let e^/] = d[i — j], where is the discrete 
Kronecker delta function. 
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Remark 1 In practice instead of using the set of monomials Q‘j we use the set of Chebyshev polyno¬ 
mials of the first kind as these lead to a more stable numerical scheme. However, for simplicity of the 
presentation we keep it to monomials. 

The first step is to decompose the locations into a series of multi-level cubes of dimension 
d. Without loss of generality we assume that all the locations are contained in a unit cube Bq at 
level 0 and index 0. If the number of locations inside Bg is more than p then equally subdivide 
Bq into 2'^ cubes (Bg, where d is the number of dimensions. It the number of 

locations is p or less then stop, associate every location with the cube Bg and denote this as 
a leaf cube. Otherwise, for each non-empty cube B^ at level q = 1 and index k if the number 
of locations is more than p then subdivide, otherwise associate all the locations to B^ and 
denote this as a leaf cube. This process is repeated for all the subdivided cubes at levels 
q = 2,..., until no subdivisions are possible. The result is a tree structure with 0,...,f levels 
(See Algorithm 1 in Castrillon-Candas et al. (2013) for more details). We denote the leaf cubes 
as all the non-empty cubes that contain at most p locations, i.e., they will correspond to the 
leafs of the tree structure. 

Remark 2 For practical cases, t increases proportionally to logn. If the inter location spacing col¬ 
lapses as n~t, where q is independent of n , then qlogn levels are needed, see Section 4 in Beatson 
and Greengard (1997) for details. 

Suppose that there is a one-to-one mapping between the set of unit vectors := {ei,.. .,6 ^}, 
which we denote as leaf unit vectors, and the set of locations {si,..., s„}, i.e. S; ^^ for 
all i = 1,..., n. It is clear that the space of {ei,..., 6 ^} is IR". The next step is to replace £ 
with a new basis of IR” that is multi-level, orthonormal and gives us the desired properties 
i) and ii) from Section 1. In Castrillon-Candas et al. (2013) the reader can find full details on 
such a construction. However, for the sake of clarity for the rest of the paper we associate the 
multi-level domain decomposition to the multi-level basis: 
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Figure 1: Multi-level domain decomposition of the location of observations for d = p = 2. All 
the observation locations s, are assumed to be contained in the unit cube Bg (colored red node 
in tree). If fhe number of observations are greater than p = 2 we subdivide into equal boxes. 
At the end we obtain a multi-level decomposition of fhe observation points. 
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• For each non empty cube for i = 0,... ,t, associate a series of multi-level basis vectors 

...} that have the following property: 

ki /C2 

= E = 0/ (8) 

’ a=l ’ 

for j = 0,... and for all the vectors g that are columns of M j. Furthermore, let be a 
matrix ...]. 

• For i = 0,..., f let W; := W''i,... ] 

• If p > p we will have an extra p — p vectors corresponding to a level —1 for the initial 

cube Bq. Now, associate p — p multi-level vectors • • •, V’p-p } that have the 

following property: 

gT^-FO ^ ^ g[a]tpj^’^[a] = 0, (9) 

a—1 

for j = 1,..., p — p and for all the vectors g that are columns of My. Similarly as above, 
letW_i := . 

• In total we will have n — p multi-level vectors and the transform matrix W G 
is built as W:= [Wj,.. .,Wo,W_i]^. 

• Now, it is clear that Wg = 0 for any g G My. To complete the basis to span R” we 
need p more orthonormal vectors. In Castrillon-Candas et al. (2013) it is shown how to 
compute such a basis and stack the vectors as rows in the matrix L G RP^”. 


With the construction of W and L we will have the following properties: a) the matrix 

r w 1 T 

P := is orthonormal, i.e., PP^ = I„; b) any vector v G IR” can be written as v = 

L 

L^vl -|- W^vw where v^ G IR^ and v^v G R”“P are unique; c) the matrix W contains at most 
0{nt) non-zero entries and L contains at most 0{np) non-zero entries. This implies that for 
any vector v G R” the computational cost of applying Wv is at most 0{nt) and Lv is at most 
0{np). 
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4 Multi-Level Covariance Matrix 


In this section we show how we can use the matrix W to produce a highly sparse representa¬ 
tion of Cw(0) with a level-dependent tapering technique. 

With the MB we can transform the observation data vector Z by applying the matrix W. 
This leads to the multi-level log-likelihood function (6). The covariance matrix C(0) is now 
transformed into Cw(0) with the structure shown in Figure 2 where each of the blocks = 
WjC(0)'Wj for all = t. This implies that the entries of the matrix are formed from 

all the interactions of the MB vectors between level i and j. Thus for any vectors 

there is a unique entry of d^ of the form C{6)tp^j\ The blocks dd where i = —1 or 

j = —1, correspond to the case where the accuracy term p > p. 








^ — 1 /^ 




Figure 2: Organization of the multi-level covariance matrix Cw(0). For this figure i = —1. 

The following lemma relates the covariance function (p, the degree / (corresponding to the 
accuracy parameter p) of the design matrix My to the decay of the entries of the matrix Cw(0). 

Lemma 1 Let be the smallest ball in with radii centered around the midpoint a G R^ of the 
cube B| such that B| C Bg. Similarly, let Bj, be the smallest ball in R^ with radii rj, G R^ centered 
around the midpoint b of the cube B^ such that C Bb- Now, since d/ satisfy the moment 
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orthogonality condition from equations (8) and (9) for all g G 7^P(S) then the following bound holds: 

\{f<*)^c(0)^f\< E E KD^y<p{x.T.e)\, (10) 

\a\=f+l |/3|=/+1 P' xeBa,y6Bb 

fori,j = 0,...,t. 

From Lemma 1 we observe that the decay of the entries of Cw(0) is dependent on the magni¬ 
tude of the derivatives of the covariance function (p{r; 6), the size of and Bb and the degree 
of Q‘j. Thus if (p{r-, 6) is smooth on and Bb the entries of Cw(0) will be small. 

Example 1 In Figure 3 we show a comparison between (a) the covariance matrix C(0) and (b) the 
multi-level covariance matrix Cw(0)/or the following example: 1) (p{r; 6) := exp(—r) and d = 3. 2) 
The observation locations (n = 8000) are sampled from a uniform distribution on the unit cube [0,1]^. 
The actual values of the observations are not necessary for this example. 3) f = 3 (leading to p = 20 
monomials). 4) We sort the Xi direction location from 0 to 1. This is done for visualization reasons 
so that we may observe the decay in the matrix C(9). Notice that the decay ofC{6) is dependent on 



1000 2000 3000 4000 5000 6000 7000 8000 1000 2000 3000 4000 5000 6000 7000 


(a) logio abs(C(0)) (b) log^Q abs(Cw(B)) 

Figure 3: Covariance matrix comparison between covariance matrices (a) log^Q abs(C(0)) and 
(b) logjg abs(Cw(0)) for the exponential covariance function (p{r) = exp(—r) with n = 8000, 
p = 20 and d = 3. 
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the covariance function (p(r-d). It is clear that for this case a tapering technique would not be very 
effective as most of the entries are comparable in magnitude. In contrast a few of the entries ofCw{d) 
with high magnitude are concentrated around particular regions while most of the entries have very 
small magnitudes making a hierarchical tapering technique to sparsify the matrix a viable option. 


To produce a sparse matrix from Cw(0) we execute the following multi-level tapering 
technique: 

• For all cubes at level i let Vjf := and := ^ U {union of all cubes at level i 

that share a face or comer with l!^ for j = 0,1,.... A consfrucfion example is shown 
in Figure 4 for level i. Now, perform this construction for i = 0,... ,t. 


Bi 


Li,o 













U 


Figure 4: Construction of expanded cubes t = 0,1,... from initial cube B{. 

• Set a user given constant t G Nq 

• The entry of Cw(B) corresponding to C{6)tp^/, lor i, j = 0,..t is computed if the 

K I 

following level dependenf criterion is true: If {j > i and Bj C (j < i and B{ C Lj'^) 

is true for the given t G Nq then compute the entry . 

K I 

• For the case that i = —1 or j = —1 the entry corresponding to {xpb^yc{6)xp{'^ is always 

K I 

computed. 


13 






















From this distance criterion we can apriori determine which entries of the sparse matrix 
Cw(0) are to be computed. For any given row of fhe mafrix Cw(0) corresponding fo level i 
and index k consfruct the expanded cube Now, for j = i,... ,t find all the cubes with 
the corresponding index I that are contained in (See Figure 5). For j = 0,... ,i — 1 find all 
the extended cubes such that B^ C Lj'^. For i,j = 0,..., f this action can be performed 
efficienfly by using the tree shown in Figure 1. 






— 

— 

i 

— 














b; 


1 

4 

- 



Bj,. b;. 


C L!-'; t = 2. 


Figure 5: Example of finding all the boxes Bj that are contained in for t = 2. 

With this criterion we can produce a highly sparse matrix Cw(0) that is close to Cw(6) in 
the matrix 2 -norm sense. 

Remarks The error ||Cw(B) — Cw(B )||2 be monotonically decreasing with respect to the 
smoothness of the covariance function, the size of the degree of the multi-level basis f > f (accuracy 
parameter p > p) and the size ofr. For a sufficiently large t and f the error ||Cw(B) — Cw(B) II 2 
be small and the matrix Cw(B) becomes positive definite. 

Now, the number of nonzeros ofCw{d) will increase as we increase r and p. To be able to determine 
the size for t and p > p it is helpful to derive an expression for the error ||Cw(B) — Cw(0) II 2 vs the 
number of non zeros ofCw{6). 
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Error estimates can be derived for || Cw(0) — Ciy(0) ||2 with respect to the smoothness of the covari¬ 
ance function, p and T, but this is beyond the scope of the present paper. In practice for the polynomial 
based model Q‘j we set the level dependent criterion parameter t := 1 and increase f (and p) until at 
least Cv\f(6) is positive definite. Moreover, the sparse Cholesky factorization code in the Suite Sparse 
package (Chen et al. (2008); Davis and Hager (2009, 2005, 2001, 1999)) that is used in this paper 
informs the user if the matrix is not positive definite. 

In Castrillon-Candas et al. (2013) the authors described how to apply a Kernel Indepen¬ 
dent Fast Multipole Method (KIFMM) by Ying et al. (2004) to compute all the diagonal blocks 
for i = 0,...,t in 0{nt) computational steps to a fixed accuracy epM > 0- This 
approach can be easily extended to compute all the blocks Cf{l{6) for i,j = — 1 ,...,f in 
0{n{t + 1)^). 

Remark 4 The KIFMM by Ying et al. (2004) is very flexible as it allows a large class of covariance 
functions to be used including the exponential, Gaussian and Matern. However, the computational 
efficiency is mostly dependent on the implementation of the covariance function (since the KIFMM 
computational cost is 0(n)) and the accuracy parameter of the solver. For all the numerical experi¬ 
ments in this paper the accuracy parameter is set to medium (10~^ to 10 “®) or high ( 10 “® or higher). 

Due to the lack of a fast math C-i-i- library for the Matern covariance function, we create a Hermite 
cubic spline interpolant of the covariance function with the multithreaded Intel Math Kernel Library 
(MKL) data fitting package. To generate a compact representation of the interpolant we implement 
an h-adaptive mesh generator in ID such that the absolute error over the range (0,2.5] is less than 
TOE. From Elden et al. (2004) given that the covariance function (p(r;6) G C'^(R), r G IR, on each 
mesh element (starting at xq G R) with length h we can guarantee that the absolute error for the cubic 
Hermite interpolant is less than TOE max^^j^^ (p^^fx; 0) < FOE, where refers to the 
fourth derivative with respect to x. In this work we set TOE = 5 x 10“^. Numerical test confirmed 
TOE accuracy for the Matern covariance function with less than 200 adaptive mesh nodes. This is 
sufficient for the numerical examples in this paper. 
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In Figure 6 we show an example of a sparse matrix produced for t = 1 for n = 8, 000 
observation locations sampled from a uniform distribution on the unit cube. Notice that the 
entries of the matrix that are not covered by the sparsity pattern are around 10“^ times smaller, 
implying the hierarchical sparsity technique will lead to good accuracy 

The total sparsity for this example is 46% (23% since the matrix Cw{d) is symmetric), 
however, the sparsity density improves significantly as n increases as we expect the number of 
non-zero entries of Cw{d) to increase at most as 0{{t + 2)^n) with the number of observations 
n (See Castrillon-Candas et al. (2013)). 

In Figure 7 the sparsity pattern of the matrix Cw(0) is shown for n = 64,000 observation 
locations sampled from a uniform distribution on the unit cube. For this case the design 
matrix is constructed from p = 20 monomials (i.e. up to cubic polynomials) and T = 1. 
The sparsity of this example is 8.2% (4.1 % since the matrix Cw(0) is symmetric). 




1000 


2000 


3000 


4000 


5000 


6000 


7000 


0 1000 2000 3000 4000 5000 6000 7000 


2 


0 


-2 


■4 



-6 

-8 

-10 


-12 


■14 


-16 


(b) Overlayed sparsity pattern in blue 


Figure 6: Sparsity pattern overlayed on Cw{d). Notice that most of the entries that are not 
covered by the blue boxes are around 10“^ times smaller in magnitude than the covered en¬ 
tries. 
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5 Multi-Level Estimator 


As the result section shows it is not necessary to compute the entire sparse matrix Cw{d) to 
obtain a good estimate of the covariance parameter 6. Due to the multi-resolution properties 
of the MB we can construct a partial multi-resolution likelihood function that is effective. 

We can produce a series of multi-resolution likelihood functions 
applying the partial transform [Wj,..., W^] to the data Z, thus 

where 7Jy^ := [W^,..., Wl]^Z, n is the length of 7}-^ and C\^{6) is the n x n upper-left sub¬ 
matrix of Cw(0). 

5.1 Computation of logdet{CjY(0)} 

Since C(0) is symmetric positive definite from (7) it can be shown that Cw(0) is also symmet¬ 
ric positive definite. It can also be shown that for a sufficiently large T and/or p the matrix 



0 


2 3 4 

nz = 335352336 


5 6 

X 10'' 


Figure 7: Sparsity pattern (8.2 % non zeros) for Cw with t = 1 and n = 64,000. 
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C\^{d) will also be symmetric positive definite. An approach to computing the determinant of 
C\^{d) is to apply a sparse Cholesky factorization technique such that GG^ = C\^{6) where 
G is a lower triangular matrix. Since the eigenvalues of G are located on the diagonal we 
have that log det{C*^(0)} = 2 log Gu- 

To reduce the fill-in of the factorization matrix G we apply the matrix reordering technique 
in Suite Sparse 4.2.1 package (Chen et al. (2008); Davis and Hager (2009, 2005, 2001, 1999)) 
with the Nested Dissection (NESDIS) function package. The sparse Cholesky factorization is 
performed with the Ichol command from Suite Sparse 4.2.1 package. 

Although in practice the combined NESDIS and sparse Cholesky factorization is highly 
efficient, as shown by our numerical results, a worse case complexity bound can be obtained. 
Eor example, it can be shown that by using the planar graph separation theorem (see George 
(1973), Gilbert and Tarjan (1987)) a worse case complexity of and 0(n(logn)^) stor¬ 

age is achieved in 2D. Similarly, the worse case complexity in 3D is 0{rp-). 

Example 2 Continuing Example 1 we compute logdet{Cw(0)} and the approximation 
log det{Cviiid)} for t = 0, 1,2, oo by applying the sparse Cholesky factorization. In Table 1 we tabu¬ 
lated the absolute Cahs '■= \ logdet{Cw(0)} — logdet{Cw(0)}| and relative e^ei '■= |iogdet{Cw(6)}| 
errors. For t = 0 we obtain a very sparse matrix (4% density), but leads to a non-positive definite 
matrix, which is not valid for the computation of the determinant. For t = 1 the matrix Cwid) be¬ 
comes positive definite. As we increase t the approximation log det{Cw(0)} becomes more accurate. 
However, the density ofCw{6) also increases. 

Table 1: Log determinant errors comparisons between Cw(0) and Cw(0). 


T 

density (%) 

log det{Cw(0)} 

^abs 

^rel 

0 

1 

4 

23 

not positive definite 
-5.184354 X 10^ 

2.23 X 10-2 4.31 X 10-6 

2 

38 

-5.184332 X 10^ 

3.78 X 10-4 7.29 x 10-® 

oo 

50 

-5.184331 X 10^ 

0 

0 
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5.2 Computation of (Z^)'^C^(0) 

We have two choices for the computation of We can either use a Cholesky 

factorization of or Preconditioned Conjugate Gradient (PCG) coupled with a KIFMM. 

The PGG choice requires significantly less memory and allows more control of the error. How¬ 
ever, we already computed the sparse Gholesky factorization of for the computation 

of the determinant. Thus we can use the same factors to compute 


6 Multi-Level Kriging 


An alternative formulation for obtaining the estimate Z(so) is by solving the following prob¬ 
lem 


C(0) M/W-yA/zA 
Mj 0 ) \ ^ ) ~ \ 0 ) ' 


( 12 ) 


It is not hard to show that the solution of this problem leads to equation (4) and 7 ( 0 ) = 
C“^(0){Z — or alternatively jS = (MjMy)“^My(Z — C{6)-y). The best unbiased 

predictor is evaluated as 

Z(so) = m(so)^jS(0) + c(0)^7(0) (13) 


and the Mean Squared Error (MSE) at the target point sq is given by 


1 + u^(M}C(0)“iMy)“iu - c(0)^C“i(0)c(0) (14) 

where u"*" := (MyC“^(0)c(0) — m(so)). 

The computational cost for computing jS( 0 ), 7(0) and the MSE accurately using a direct 
method is 0{n^), which is unfeasible for large size problems. We propose a much faster 
approach. 

From (12) we observe that MJ7(0) = 0. This implies that 7 G R”\7^P(S) and can be 
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uniquely rewritten as 7 = W^ 7 ^ for some 7 ^ G R” P. We can rewrite C( 0)7 + M/iS = Z as 

C(0)wT7^ + M^jS = Z. (15) 

Now apply the matrix W to equation (15) and we obtain W{C( 0 )W^ 7 ^ + MyjS} = WZ. 
Since WMy = 0 then Cw(0)7w = ^w- Applying the preconditioner D^^(0), where 
Dw(0) := diag(Cw(0)), we have the system of equations Cw(0)7w(0) = where 
7 ^( 0 ) := Dw( 0 ) 7 w(^)/Cw( 0 ) •= D^^( 0 )Cw( 0 )D^^( 0 ) and Zw := D^^( 0 )Zw. 

This system of equations is solved by a combination of a KIFMM and PCG. If Cw(0) and 
Dw( 0 ) are symmetric positive definite then an effective method to solve Cw( 0 ) 7 w( 0 ) = 
is the PCG method implemented in PETSc by Balay et al. (2013b,a, 1997). 

Lemma 2 If the covariance function cp is positive definite, then the matrix D w (0) is always symmetric 
positive definite. 

The matrix-vector products Cw(0)v, where v G are computed in 0{n) computa¬ 

tional steps to a fixed accuracy epM > 0. The total computational cost is 0{kn{t + 2)), where 
k is the number of iterations needed to solve Cw(0)7w(^) = Zw to a predetermined accuracy 

£pcG > 0 - 

It is important to point out that the introduction of a preconditioner can degrade the per¬ 
formance of the PCG, in particular, if the preconditioner is ill-conditioned. The accuracy of 
the PCG method epcG has to be set such that the accuracy of the unpreconditioned system 
Cw ( 0 )avv( 0 ) = Zw is below a user given tolerance e > 0 . 

We compute 7 = W"'’ 7 ^ and jS = — C(6)j) in at most 0{np^ + p^) 

computational steps. The matrix vector product c( 0 )^ 7 ( 0 ) is computed in 0{n) steps. Finally, 
the total cost for computing the estimate Z(so) from (13) is 0{np^ p'^ kn(t -|- 2 )). 

In Appendix C we show a procedure to compute the MSE fast. 
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7 Numerical Study and Statistical Examples 


In this section we test the numerical efficiency and accuracy of our solver for computing the 
terms logdet{C^Y(0)} and 7vv(^) = for Matem covariances. Our results show 

that we are able to solve problems of up to 128,000 observations and krigrng up to 512,000 size 
problems with good accuracy Our approach is not limited to 128,000 for parameter estima¬ 
tion. This was the maximum we could test due to the memory limitation on our workstation 
in creating observations larger than 128,000. We now describe the data sets. 

Data set #1 and #2: The sets of observation locations Sf ,..., S^q vary from 1,000 to 512,000 
and we assume that Sf C ^ for I = 1,..., 9 for d = 2 and d = 3. The observations locations 
are sampled from a uniform distribution over the unit square [0,1]^ for d = 2 (data set #1) 
and for [0,1]^ for d = 3 (data set #2), as shown in Figure 8. The target points sq are set to 1000 
random points across the domain [0,1]^ (data set #1) and [0, l]^(data set #2). We shall refer to 
Zf,... Zfg as the observation values associated with Sf,... S^g. 

Data set #3: We take the data set generated by Sg for d = 2 (256,000 observation points) 
and carve out two disks located at (1/4,1/4) and (3/4,3/4) with radii 1/4. This generates 
100,637 observation points; see Figure 8 (c) for an example with 1,562 observation points ran¬ 
domly extracted from the data set. 

We now test our approach on the Matern covariance function (p{r; 6) := y(v)2''-^ 

Ky where F is the gamma function and Ky is the modified Bessel function of the 

second kind. All results are executed on a single CPU (4 core Intel 17-3770 CPU @ 3.40GHz.) 
with Linux Uhuntu 13.04. 

7.1 Parameter Estimation 


In this section we present the results for data set #1 and #3 for the Matern parameter estima¬ 
tion. 

Suppose we have two realizations of the Gaussian spatial random field Z(s) with the 


21 




Figure 8 : Data set examples: (a) Data set #1: One thousand observation points randomly 
generated from a uniform distribution on [0,1]^. (b) Data set #2: One thousand observation 
points randomly generated from a uniform distribution on [0,1]^. The color of the observation 
locations represent the distance along the right axis coordinate, (c) Data set #3: Two disks of 
1562 randomly generated observation locations. The disks are contained in a square but are 
represented in a multi-level representation. 

Matem covariance for data set #1 (2D). We set / = 3 (p = 10) and / = 4,5 ,6 (corresponding 
to p = 15,21,28) and fix the covariance parameters to 0 = (v,p) = (3/4,1/6). Two realiza¬ 
tions (n = 64,000) and (n = 128,000) are generated from these parameters. For each 
observation values Z^ and Z 7 (and locations) apply the transformation W to compute Z^ ^ 
and Z^ 7 and solve the optimization problems 0; := argmax^^j-Q ^ (Z^ ■, 0) for i = 6 

and 7. 

The optimization problem from the log-likelihood function ( 6 ) is solved using fminsearch 
from the optimization toolbox in MATLAB with the local minimizer search for v in the in¬ 
terval [1/2,5/4] and p in the interval [1/7,1/5]. We set the parameter criterion to T := 1, 
and the fminsearch tolerance is set to 10“^. In Table 2 we tabulate the results for the parameter 
estimates v and p for different problem sizes of data sets #1 and #3 for the user defined param¬ 
eters / for the construction of the MB and i for the construction of the matrix C(y. We notice 
that the estimates of (v,p) seem to approach the actual values as we increase the number of 
observations. In particular, for n = 128,000 the estimate (v,p) := (0.7498,0.1672) is very 
close to the actual noise model parameters (v, p) = (0.75,1/6) of the covariance function. We 
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observe that as we increase the number of levels (i.e. decrease i) in the covariance matrix the 
absolute error decays until it stagnates, usually by the time that the covariance matrix is for 
two levels. We also report the wall clock times (i.e. actual time it took the executable to run. 


Table 2: Estimation results for data sets #1 and #3. The observation data is generated with 
covariance parameters v = 0.75 and p = 1/6. The degree of the model is / = 3 (cubic), which 
gives p = 10 monomials and we set t := 1. For all the experimental runs the number of MB 
levels is f = 6. The first column is the size of the problem. Columns 2 and 3 are the parameters 
that are chosen to build the MB and the multi-level estimator. Column 4 is the corresponding 
p for / and the maximum level. Columns 5 and 6 are the errors of v — v and p — p- Column 7 
is the percentage of non-zeros of the Cholesky factors. Column 8 is self-explanatory. Column 
9 and 10 are the approximate wall clock computational time (in seconds) needed to compute 
C\^{d) and to perform Cholesky factorization respectively. The total time for each Newton 
iteration is about icons(s) -l- tchoi{s)- For each problem it takes about 50 Newton iterations to 
converge. Estimation results for data set #3 with observation data generated with covariance 
parameters v = 1.25 and |0 = 1/6. 


Estimation results for data set #1 (2D). 


n 

/ 

i 

P 

V — V 

p-p 

nz(G)(%) 

size(C(v) 

icons (s) 

i-chol (^) 

64,000 

6 

6 

28 

-0.0759 

0.0333 

8.9 

23 

14 

0 

64,000 

6 

5 

28 

0.0182 

-0.0132 

1.7 

35328 

40 

1 

64,000 

6 

4 

28 

-0.0043 

0.0046 

4.5 

56832 

230 

11 

64,000 

6 

3 

28 

-0.0049 

0.0048 

10.7 

62208 

961 

65 

64,000 

5 

6 

21 

0.0071 

-0.0146 

0.4 

810 

13 

0 

64,000 

5 

5 

21 

0.0037 

-0.0027 

1.7 

42496 

43 

2 

64,000 

5 

4 

21 

-0.0030 

0.0048 

3.7 

58624 

220 

12 

64,000 

5 

3 

21 

-0.0048 

0.0046 

7.0 

62656 

750 

32 

64,000 

4 

6 

15 

-0.0080 

0.0098 

0.2 

7749 

13 

0 

64,000 

4 

5 

15 

-0.0047 

0.0043 

2.0 

48640 

53 

3 

64,000 

4 

4 

15 

-0.0068 

0.0062 

3.4 

60160 

161 

9 

64,000 

4 

3 

15 

-0.0051 

0.0048 

4.4 

63040 

550 

15 

128,000 

6 

6 

28 

0.0010 

- 0.0011 

0.3 

17179 

75 

0 

128,000 

6 

5 

28 

0.0025 

- 0.0020 

2.1 

99328 

350 

13 

128,000 

6 

4 

28 

- 0.0002 

0.0005 

4.0 

120832 

1200 

70 

128,000 

5 

6 

21 

- 0.0010 

0.0015 

0.5 

42154 

80 

0 

128,000 

5 

5 

21 

0.0004 

- 0.0002 

1.9 

106496 

300 

14 

128,000 

5 

4 

21 

-0.0016 

0.0017 

3.3 

122624 

1000 

50 





Estimation results for data set #3 (2D). 



n 

/ 

i 

P 

V — V 

p-p 

nz(G)(%) 

size(C'^) 

icons (s) 

i-chol (^) 

100,637 

6 

6 

66 

0.0548 

-0.0237 

0.5 

2613 

60 

0 

100,637 

6 

5 

66 

-0.0031 

0.0020 

3.1 

72231 

600 

12 
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not to be confused with CPU time that is unreliable as a measure) for computing each Newton 
iteration. The total number of Newton iterations is approximately 50. 

In Table 2 the results for parameter estimation with data set #3 are tabulated. The real¬ 
ization is obtained from the Gaussian random field Z(s) with n = 100,637, v = 1.25 and 
|0 = 1 / 6 . For this case the absolute error is 0.25% for the estimate v and 0.04% for p. 

In Table 3 we generate M = 100 realizations of the stochastic model for data set #1, to 
analyze the mean and standard deviation of the Matern covariance parameter estimates. The 
mean estimate ]Em[v] refers to the mean of M estimates v for the the M realizations of the 
stochastic model. Similarly, stdM[v] refers to the the standard deviation of the M realizations. 
We observe that the mean appears to approach the covariance parameters as we decrease i. 
As i is reduced from f to f — 1 there is a significant drop in the term stdM[v]. However, for 
i < t — 1 the standard deviation stdM [v] does not improve significantly. Therefore, there is 
not much gain in improving the estimate by decreasing i, which increases the computational 
cost in computing C\^. 


Table 3: (a) Statistical results for data set #1 with observation data generated with covariance 
parameters v = 0.75 and p = 1/6. The parameter M = 100 is the number of realizations of 
the stochastic model, ~ is the bias of M estimates of v and stdM[v] is the standard 

deviation of M estimates of v. As i is reduced from f to f — 1 there is a significant drop in 
stdM[v]. However, for i < t — 1 the standard deviation of the estimates does not improve 
significantly. This indicates that a good estimate can be obtained for i = t — 1 and there is not 
much gain in reducing i, which increases the computational cost in computing 


Statistical results for data set #1 with multiple realizations. 


n 

/ 

i 

P 

t 

Em[v - V 

^^Mlp-p] 

stdj^/i V 

stdi^lp 

32,000 

4 

5 

15 

5 

-6.0 X 10-4 

1.0 X 10-^ 

1.3 X 10-^ 

1.0 X 10-^ 

32,000 

4 

4 

15 

5 

-7.2 X 10-4 

7.0 X 10-4 

5.9 X 10-3 

4.5 X 10-3 

32,000 

4 

3 

15 

5 

-7.0 X 10-4 

6.0 X 10-4 

5.6 X 10-3 

4.0 X 10-3 

64,000 

4 

6 

15 

6 

6.8 X 10-4 

1.1 X 10-^ 

2.0 X 10-^ 

1.9 X 10-^ 

64,000 

4 

5 

15 

6 

7.4 X 10-4 

-5.6 X 10-4 

5.4 X 10-3 

4.6 X 10-3 

64,000 

4 

4 

15 

6 

2.5 X 10-4 

-1.7 X 10-4 

3.9 X 10-3 

3.3 X 10-3 

128,000 

6 

6 

28 

6 

-1.3 X 10-^ 

1.5 X 10-^ 

8.3 X 10-3 

7.7 X 10-3 

128,000 

6 

5 

28 

6 

-6.2 X 10-4 

6.5 X 10-4 

3.7 X 10-3 

3.3 X 10-3 
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7.2 Numerical examples for computing Cw(0) and Kriging 


We test our approach for solving the system of equations Cw(0)7vv = (that we have 
to solve to obtain the kriging predictor) on the data sets #1 and #2. We also include results 
showing the kriging prediction error between the multi-level and direct methods. 

We first test the PCG method with data set #2 (3D) on three test cases: (a) da = (v,p) = 
(3/4,1/6), (b) Oh = (1,1/6) and (c) dc = (5/4,1/6). The value p = 1/6 gives us an approx¬ 
imate decay of 5% (which is reasonable in practice) from the center of the cube along each 
dimensional axis. The PCG relative error tolerance epsilonpcc > 0 is set to a value that leads 
to a relative error e = 10 “^ of the unpreconditioned system Cw( 0 ) 7 w = 

In Table 4 we report the total wall clock times and iterations for computing jS, 7 and the 
target Z(so) for data set #2 (3D) with the Matern covariance function . The polynomial ac¬ 
curacy of the model is set to cubic (/ = 3, p = 20) and the accuracy parameter p is set to 
20 (which corresponds to / = 3). We look at three cases: For (a) {da = (3/4,1/6)) we set 
the KIFMM accuracy to medium and the number of iterations increase as G(n®-^®). For (b) 
{dh = (1,1/ 6 )) we set the KIFMM accuracy to medium and the number of iterations increases 
as 0{n^'^^). For (c) {dc = (5/4,1/6)) we set the KIFMM accuracy to high and the number of 
iterations increases as 0{n^'^). 

In Table 4 we also report the number of iterations needed for solving C“^(0)Z with 10“^ 
accuracy with a GG method. In this case the number of iterations is about 10 times larger 
than the multi-level version. Moreover, for solving the kriging problem (e.g. equation (4)), p 
such problems have to be solved. Thus, it is at least about 200 times faster since p = 20 for 
this case. An alternative is to solve (12). However, in general it is not positive definite. The 
matrix is highly ill-conditioned also making it difficult to solve with an iterative solver such 
as generalized minimal residual method (see Gastrillon-Gandas et al. (2013)). 

In Table 5 the results for computing jS, 7 and Z(so) for 1000 target points sq for data set 
#1 (2D) with the Matern covariance function are tabulated. We have three test cases: (a) 
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Table 4: Diagonal pre-conditioned results for computing Cw(0)7py = for data set #2 
(3D) with the Matern covariance 6 = {v,p). We look at three cases: (a) 6a = (3/4,1/6) (b) 
01, = (1,1/6) and (c) 6c = (5/4,1/6). The relative error of the residual of the unprecondi¬ 
tioned system is set to e = 10“^. The KIFMM is set to medium accuracy for (a) and (b), and 
set to high accuracy for (c). The second column is the number of iterations needed to obtain 
10“^ relative error of the unpreconditioned system with Cw(0). We denote as itr(Cw) as the 
number of CG iterations needed for convergence until the desired accuracy is achieved. The 
third column is the number of iterations for solving C“^(0)Z with 10“^ accuracy. The fourth 
column is the residual tolerance needed for convergence of 10“^ relative error for the unpre- 


conditioned system. The fifth column presents the wall clock times for initialization (basis 
construction and preconditioner computation). The PCG iteration wall clock times for Cyy 
are given in the sixth column. The last column presents the total wall clock time to compute 
7W = Cvy( 0 )“^Zvy. 

(a) Oa = (3/4,1/6), d = 3j = 3ip = 20), f = 3 {p = 20) 
n itr(Cw) itr(C) epcG Diag. (s) Itr (s) Total (s) 

16,000 

166 

1296 

1.02 X 10“^ 

80 

113 

193 

32,000 

247 

3065 

9.88 X 10“^ 

215 

321 

536 

64,000 

372 

5517 

1.00 X 10-4 

665 

1226 

1891 

128,000 

547 

- 

4.84 X 10-5 

2060 

3237 

5397 

256,000 

847 

- 

5.00 X 10-5 

5775 

9885 

15660 

512,000 

1129 

- 

3.74 X 10-5 

17896 

33116 

51012 

(b) 0 fc = (l,l/ 6 ),d: 

CO 

II 

CO 

II 

= 20 ),/ = 

3{p = 

20 ) 

n 

itr(Cw) 

itr(C) 

SPCG 

Diag. (s) 

Itr (s) 

Total (s) 

16,000 

293 

2970 

8.23 X 10-5 

79 

198 

277 

32,000 

470 

7786 

8.18 X 10-5 

213 

607 

820 

64,000 

760 

15808 

7.09 X 10-5 

662 

2495 

3157 

128,000 

1167 

- 

3.00 X 10-5 

2050 

7109 

9159 

256,000 

1961 

- 

3.27 X 10-5 

5789 

22878 

28667 

(c) 0, = (5/4,l/6),d 

CO 

II 

c<5 

II 

= 20 ), / = 

= 3{p = 

: 20 ) 

n 

itr(Cw) 

itr(C) 

SPCG 

Diag. (s) 

Itr (s) 

Total (s) 

16,000 

500 

5953 

6.27 X 10-5 

138 

580 

718 

32,000 

827 

17029 

7.29 X 10-5 

346 

1574 

1920 

64,000 

1567 

37018 

4.45 X 10-5 

910 

6474 

7384 

128,000 

2381 

- 

2.23 X 10-5 

3974 

25052 

29026 

256,000 

4299 

- 

2.61 X 10-5 

10322 

72374 

82696 


6a = (v,p) = (1/2,1/6) (note for this case we obtain an exponential covariance function), (b) 
01, = (3/4,1/6), and (c) 6c = (1,1/6). For (a) and (b) the KIFMM accuracy is set to medium. 
For (c) the KIFMM accuracy is set to high. For this case the relative residual accuracy for the 
unpreconditioned system is fixed at 10“^. 
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Table 5: Diagonal pre-conditioned results for computing Cw{d)yY^ = Z^v for the Matern co- 
variance with 6 = (v,p) for data set #1 (2D). We look at three cases: (a) da = (3/4,1/6), (b) 
01, = (1,1/6) and (c) 6c = (5/4,1/6). The relative error of the residual of the unprecondi¬ 
tioned system is set to e = 10“^. The KIFMM is set to medium accuracy for (a), (b) and set to 
high accuracy for (c). 


(a) 0 

« = (l/2,l/6),d 

CO 

II 

(N 

II 

= 10),/ = 

12 ip = 

91) 

n 

itr(Cw) 

itr(C) 

£pCG 

Diag. (s) 

Itr (s) 

Total (s) 

16,000 

330 

3603 

' 239 ^( 10 ^ 

246 

115 

361 

32,000 

333 

5429 

1.39 X 10-3 

750 

251 

1001 

64,000 

455 

8152 

1.32 X 10-3 

1947 

589 

2536 

128,000 

564 

- 

7.10 X 10-4 

5570 

1577 

7147 

256,000 

619 

- 

9.78 X 10-4 

15266 

3065 

18331 

512,000 

1230 

- 

4.50 X 10-4 

42254 

13101 

55355 

(b) 0 , 

= (3/4,1/6), d = 

= 2,/ = 14 (p 

= 120 ),/ = 

= 14 (p = 

= 120 ) 

n 

itr(Cw) 

itr(C) 

£PGG 

Diag. (s) 

Itr (s) 

Total (s) 

16,000 

965 

26795 

2.78 X 10-3 

370 

397 

767 

32,000 

1110 

41079 

2.04 X 10-3 

1125 

1061 

2186 

64,000 

2239 

82166 

1.35 X 10-3 

2892 

3714 

6606 

128,000 

3443 

- 

1.09 X 10-3 

8268 

13130 

21398 

256,000 

4557 

- 

7.63 X 10-4 

23175 

30302 

53477 


(c) Be = (1,1/6), d = 2j = U{p = 120), / = 14 (p = 120) 
n itr(Cw) ih(C) £pcG Diag. (s) Itr (s) Total (s) 


16,000 

2710 

> 100,000 1.90 X 10-^ 

553 

1844 

2397 

32,000 

4261 

- 1.43 X 10-3 

1522 

5713 

7235 

64,000 

8801 

- 1.00 X 10-4 

5022 

23785 

28807 

128,000 

14405 

- 7.28 X 10-4 

12587 

75937 

88524 


We note that for this case the results are even more impressive than for the 3D case. For 
Table 5 (c) the CG solver stagnated and we terminated the iteration after 100,000. At this point 
the matrix C(d) is highly ill-conditioned. In contrast, with Cw(0) we are still able to solve the 
problem even for 128,000 size problem. 

The diagonal preconditioner we use is one of the simplest. We plan to extend this approach 
to more sophisticated preconditioners such as block Symmetric Successive OverRelaxation 
(SSOR) (see Castrillon-Candas et al. (2013)) in the future. 

The residual errors are then propagated to the final estimate Z(so) around the same mag¬ 
nitude. However, as a final experiment in Table 6 we tabulate the relative I 2 error between the 
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multi-level kriging approach and the direct method for data set #2 with exponential covari¬ 
ance function exp{—6r), where 9 = 5.9915 and / = 3. The PCG tolerance is set to 10“^ and 
/ = 3. Notice that the error increases with n. This is expected since the unpreconditioned 
system error will degrade. 


Table 6: Tabulation of the kriging estimate relative I 2 error between the multi-level kriging 
approach and the direct method for Data Set #1 (3D) for 1000 target points. The covariance 
function is exp{—9r), where 9 = 5.9915. The polynomial bias term in the Gaussian model is 
set to / = 3 (p = 20). The solver tolerance is set to epcc = 10“^ and accuracy parameter is set 
to p = 20 (/ = 3). 


n 

I 2 Relative Error 

1,000 

1.53 X 10“^ 

2,000 

6.71 X 10-5 

4,000 

6.42 X 10-5 

8,000 

1.01 X 10-4 

16,000 

9.14 X 10-5 

32,000 

1.05 X 10-4 


8 Conclusions 

In this paper we developed a multi-level restricted Gaussian maximum likelihood method 
for estimating the covariance function parameters and the computation of the best unbiased 
predictor. Our approach produces a new set of multi-level contrasts that decouples the co- 
variance parameters from the deterministic components. In addition, the covariance matrix 
exhibits fast decay independently from the decay rate of the covariance function. Due to the 
fast decay of the covariance matrix only a small set of coefficients of the covariance matrix are 
computed with a level-dependent criterion. We showed results of our method for the Matem 
covariance with highly irregular placement of the observation locations to good accuracy. 

We are currently working on deriving error estimates of the kriging estimate and determi¬ 
nant computation with respect to the number of degrees of freedom n. We are also contem¬ 
plating extending our multi-level approach to multivariate random fields and cokriging (e.g. 
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Furrer and Genton (2011)). 

Our method also applies to non-stationary problems if the covariance function is differen¬ 
tiable to degree /. For example, if the covariance function changes smoothly with respect to 
the location. Lemma 1 still applies and the multi-level covariance matrix decays at the same 
rate as a stationary one. Now, even if the covariance function is non differentiable everywhere 
with respect to the location. Lemma 1 still applies, but at a lower decay rate. 

Note that we have not made direct comparisons with many of the approaches described 
in Section 1. These methods are very good at solving a particular type of problem i.e. grid¬ 
like geometries and/or compact covariance functions. For these situations we recommend 
using the approaches already developed in the literature as they work well and are easier to 
implement. However, to our knowledge we have no seen fast results where the placement of 
the observations are highly irregular and the Matem covariance function decays slowly. We 
plan to include some comparison for the spatial covariance paper we plan to write. 

Appendix A: Proofs 

Lemma 1: 

Since (p(r;d) is in C^+^(]R), then by Taylor's theorem we have that for every x G Ba 
(pix,y;e) = E|^|</ + R^(x,y;B), where (x - := (xi - fli)“i • • • - 

fld)“La! := ad • • • a^!, and R^t(x,y; 0) := D^^(a-F s(x - a),y; 0) for some s G 

[0,1]. Now, recall that tp~^ is orthogonal to V^iS) then Y2=\ = Eh=i 

K K K 

Roc (sh, y; 0)• Since is also orthogonal toV^ (S) then by applying Taylor's theorem centered 
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at b G 






h=l e=l 


EE E 


h=l e=l 


kl=/+i 


(sfe - a) 
a! 


E 

l/3h/+l 




D^D^(p(a + s{s}i - a),b + t{se - h);6)tf>b’^[h]tp^/[e 


-EE 

l«l=/+l|/5|=/+l 


— — \D^Dt(p(a + s{sh - a),b + t(se - b);0)| 


a! |6! 


E E V’f MV’f k 


r=l e=l 


j.; 


^ E E |C'xP’y</’(x,y;0)|, 

|a|=/+l |/3|=/+1 P' x6Ba,y6Bb 


for some s, f G [0,1]. The last inequality follows since from Schwartz' inequality Eh=i I I ^ 

= 1 and egi I <!>'{' [<■] I < \/n.i(<fi'w)" = 1- □ 

Lemma 2: Since the Matem covariance function (p{x, y; 0) is positive definite we have that for 
all v 7 ^ 0: 

n n 

^ ViVjC^e) = Y, > 0 , 

iy=i iy=i 

where C^'}{d) is the {i,j) entry of the matrix C(0). The diagonal terms of Cw are of the form 
This implies that 


(v-f )Tc(0)^>f > 0. 

Thus, will always be positive definite. □ 
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Appendix B: Notation 


Index for when the following are first defined, mentioned or reformulated. 


d e N 

Dimension of problem (pi) 

n e N 

Number of observations (pi) 

t e N 

Maximum MB level (p8) 

p e N 

Number of columns of M (pi) 

/e N 

Polynomial degree (p7) 

p e N 

Accuracy parameter of MB (p7) 

/e N 

Degree of multilevel basis. p(7) 

a; e N 

Dimension of 0 (pi) 

M e ]R"xp 

Design matrix (pi) 

H-) 

Covariance function (p7) 

S := {si, ■ . ■ ,Sn 

,} Locations of observations (pi) 

6 := {v,p) e ]R^ 

Param. of matern kernel (pi) 

C(0) e ]R"xn 

Covariance matrix (pi) 

ZeW 

Observation values (pi) 

So e 

Target points set (pi) 

pp(S) 

Span of the columns of M (pi) 


Vector of unknowns from 


Estimates 


deterministic model (pi) 

Z(so) 

kriging estimate at sq (p2) 


Log-likelihood function (pi) 

K 

Cube at level q and index k (p8) 

L 

]Rn ^ (p4) 

W 

R” ^ CPP(S))-L (p4) 

pT 

wT L^ (plO) 

ziv 

(Pl7) 




Multilevel log-likelihood (pi7) 

Qj 

Set of polynomial monomials 
of order / and dimension d (p7) 

T e No 

Level dependent 
criterion constant (pl3) 


Appendix C 


Using the approach developed in this paper we can compute the MSE at the target point sq. 


Now, since P^P = I wehavethatM)'C(0)-iMr = M|-P^'PC(0)-ip^'PMr = M|-P^'Kw(0)“^PMr, 


Tr»Ti 


\ — 1 t>Ti 


rTi5Ti 


\-li 


V' 


r 




Kw(0) := PK(0)PT, Kw(0) := 


Cw( 0 ) aw 

Sw( 0 ) 


‘w 


where Sw(0) := LC(0)LT g and 


aw(0) := WC(0 )lT g R(«-P)xp. pet Sw(0) := (Sw(0) - aJvCw(0)“^aw)“^ then 
r r' Ja\ — l I Ja\ — l c Ja\^T J 

Kw( 0 )“^ : = 


Cw(0) ^ + Cw(0) ^awSw(0)a^Cw(0) ^ -Cw(0) ^awSw(0) 
—Sw(0)a^Cw(0) ^ Sw(0) 


Given that = 0 then MJC(0) = (LMyr)^Sw(0)(LMyr). Following a similar argu¬ 

ment we have that equation (14) becomes 


1 + iji^M}{LMffSwid){LMf)vi - cJvK-i(0)cw, (16) 

where cw := Pc. By using matrix-vector products with the PCG method Sw(0) can be com¬ 
puted in 0{{k{t -|- 2) -|- l)np + p^). Thus the term (LMyr)^Sw(0)(LMy:) can be computed in 
0{{k(t -|- 2) -|- l)np -|- -|- (f -|- 2)pn). Now, by also using matrix-vector products the term 
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K^^(0)cw can be computed in 0{kn(t + 2) + p^). Thus the total cost for computing equation 
(16) is 0{{k{t + 2) + l)np + p"^ + (f + 2)pn). 
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